Using interpretative machine learning to analyze spatial distribution of sociodemographic profiles influencing voting patterns in U.S. presidential elections (2008-2024)

24th European Colloquium on Theoretical and Quantitative Geography, Tallinn, Estonia

Adrian Nowacki, Anna Dmowska, Jarosław Jasiewicz
Adam Mickiewicz University in Poznań

09-12-2025

Introduction

Various socio-economic factors play a key role in shaping electoral outcomes. Voting patterns in U.S. presidential elections change over time and space, reflecting society’s responses to shifting socio-economic and political conditions.

The relationship between demographic structure and electoral preferences can be analyzed in two ways:

  • directly

  • indirectly
  • Indirectly obtained data are too complex and highly interdependent to be interpreted with classical methods.
  • To describe the nonlinear relationships between election outcomes and socio-demographic factors, machine learning methods were applied.
  • Many advanced models detect complex patterns without revealing their decision structure, so interpretable machine learning is used to decompose and explain these relationships.


The use of IML is essential to explain socio-demographic influences on election outcomes.

Project objective


The aim of the project was to investigate spatial socio-demographic profiles influencing voting patterns in the U.S. presidential elections from 2008 to 2024 at the county level.


An equally important goal was to separate spatial voting patterns from changes in voter preferences over time.


Problem:

Difficulty in obtaining interpretable results due to the high spatio-temporal complexity of the data.


Data

The project provides detailed socio-economic and demographic data about the U.S. population.


Five-year averages used for each election year:

  • 2008 election (ACS 2005-2009)
  • 2012 election (ACS 2008-2012)
  • 2016 election (ACS 2012-2016)
  • 2020 election (ACS 2016-2020)
  • 2024 election (ACS 2020-2024)

tonmcg/US_County_Level_Election_Results_08-24

The dataset (2008–2024) combines official state results, providing vote counts, differences, and percentages for all continental U.S. counties by candidates from:

Democratic Party Republican Party

Barack Obama

Barack Obama

2008

John McCain

Barack Obama

Barack Obama

Barack Obama

2012

Mitt Romney

Barack Obama

Barack Obama

Hillary Clinton    

2016

Donald Trump

Barack Obama

Barack Obama

Joe Biden             

2020

Donald Trump

Barack Obama

Barack Obama

Kamala Harris    

2024

Donald Trump

Barack Obama

Methods

flowchart TB
  subgraph data_preparation ["I. Data Preparation"]
    A1[/Preparation of demographic variables/]
    A2[/Acquisition of election results/]
  end
  subgraph PCA ["II. Principal Component Analysis"]
    direction LR
    B1[PCA contribution analysis and reduction from 77 to 39 variables]
    B2[Performing PCA on 39 variables]
  end  
    B3[\Set of 12 components\]
    B4[\Table of factor loadings\]
    B5[\Maps of principal component values\]
    B6[\Impact of variables on principal components\]
    A1 --> PCA
    B1 --> B2 
    PCA --> B3
    PCA --> B4
    PCA --> B5
    PCA --> B6
  subgraph Modeling ["III. Machine Learning"]
    direction LR
    C0[Data merging and preparation]
    C1[Model training]
    C2[Model performance evaluation]
    C3[Selection of the best model]
  end
    A2 --merge dataset with 12 components--> Modeling
    B3 --> Modeling
    C0 --> C1 --> C2 --> C3
    Modeling --> SHAP
  subgraph SHAP ["IV. Model Decomposition and Interpretation"]
    D1[Calculation of SHAP values]
  end
    D2[\Variable importance plots\]
    D3[\Dependence plots\]
    D4[\Maps of component impacts\]
    SHAP --> D2
    SHAP --> D3
    SHAP --> D4

Principal Component Analysis (PCA)

flowchart TB
  subgraph data_preparation ["I. Data Preparation"]
    A1[/Preparation of demographic variables/]
    A2[/Acquisition of election results/]
  end
  subgraph PCA ["II. Principal Component Analysis"]
    direction LR
    B1[PCA contribution analysis and reduction from 77 to 39 variables]
    B2[Performing PCA on 39 variables]
  end  
    B3[\Set of 12 components\]
    B4[\Table of factor loadings\]
    B5[\Maps of principal component values\]
    B6[\Impact of variables on principal components\]
    A1 --> PCA
    B1 --> B2 
    PCA --> B3
    PCA --> B4
    PCA --> B5
    PCA --> B6
  subgraph Modeling ["III. Machine Learning"]
    direction LR
    C0[Data merging and preparation]
    C1[Model training]
    C2[Model performance evaluation]
    C3[Selection of the best model]
  end
    A2 --merge dataset with 12 components--> Modeling
    B3 --> Modeling
    C0 --> C1 --> C2 --> C3
    Modeling --> SHAP
  subgraph SHAP ["IV. Model Decomposition and Interpretation"]
    D1[Calculation of SHAP values]
  end
    D2[\Variable importance plots\]
    D3[\Dependence plots\]
    D4[\Maps of component impacts\]
    SHAP --> D2
    SHAP --> D3
    SHAP --> D4
    
    %%classDef dane fill:#e6f0ff,stroke:#2a3f5f,stroke-width:1px;
    %%classDef dane_element fill:#9ab4db,stroke:#2a3f5f,stroke-width:1px;
    
    classDef dane fill:#999999,stroke:#000000,stroke-width:0.8px;
    classDef dane_element fill:white;
    class PCA dane;
    class A1,A2 dane_element;
    

Principal Component Analysis (PCA)

The purpose of PCA was to reduce socio-demographic variables to key components describing population profiles, while minimizing information loss, in three stages:

%%{init: {
  "flowchart": { "htmlLabels": true }
}}%%

flowchart LR

    A1["STAGE 1
    
    Preparation of 77 variables &nbsp"]

    A2["STAGE 2
    
    Reduction to 39 variables &nbsp"]

    A3["STAGE 3
    
    Reduction to 12 principal components &nbsp &nbsp"]

     A1 -->|PCA| A2
     A2 -->|PCA| A3

    classDef redBlock fill:#8d4444,stroke:#600,stroke-width:1px,color:#fff;
    class A1,A2,A3 redBlock;
Stage I
  • Selection of variables capturing a wide range of socio-demographic characteristics.
  • From ~1,100 available variables, 77 were extracted, standardized for each election year (2008–2024).
  • Variables grouped into five categories: gender, education, income, race & ethnicity, occupation & employment.
  • Unified dataset with 15,500+ observations enabled stable PCA across years.


The applied approach enabled each principal component to retain the same interpretative meaning across all analyzed years.

Principal Component Analysis (PCA)

Stage II

PCA was applied to the dataset, reducing it to 39 variables most influencing variance and principal components.

Principal Component Analysis (PCA)

Stage III

A dataset of 12 uncorrelated principal components was selected, explaining 87.4% of total variance.

Principal Component Analysis (PCA)

Factor loadings were extracted to show how strongly each variable contributes to a principal component.

High positive values represent a strong positive correlation with a component, while high negative values indicate a strong negative correlation.

A particular challenge was the precise interpretation of the meaning of individual components as socio-demographic profiles shaping voting patterns.

Principal Component Analysis (PCA)

Detailed interpretation was based on analyzing variable influence on each principal component to identify those with the greatest contribution.

Variables with positive loadings are shown in red, and negative loadings in blue.


For the first principal component, the strongest influence came from variables above 6%: higher education and U.S. citizenship by naturalization.

Principal Component Analysis (PCA)

The PCA results also allow for the interpretation of component scores for all records in the dataset (counties).


A high positive component score means that a given county strongly reflects the pattern described by the component (by the variables with high loadings for that component).


A high negative value indicates that it is opposed to this pattern.

Principal Component Analysis (PCA)

Principal Component Analysis (PCA)

Principal Component Analysis (PCA)

Principal Component Analysis (PCA)

Principal Component Analysis (PCA)

Principal Component Analysis (PCA)

The first component reflects a socio-demographic profile contrasting poorer, less educated, and homogeneous peripheral counties with wealthier, educated, urban, and diverse ones.


%%{init: {"flowchart": {"htmlLabels": true}} }%%
flowchart TB
    A1[Factor loadings of variables &nbsp &nbsp]
    A2[Impact of variables on principal components &nbsp &nbsp]
    A3[Spatial distribution of principal component values &nbsp &nbsp]
    A4[Interpretation of socio-demographic profiles &nbsp &nbsp]
    A1 --> A4
    A2 --> A4
    A3 --> A4
    classDef redBlock fill:#8d4444,stroke:#600,stroke-width:1px,color:#fff;
    class A1,A2,A3,A4 redBlock;

Each of the 12 components from PCA results was described as part of broader socio-demographic profiles of voters, which provided a stronger foundation for interpreting the machine learning model.

Machine learning

flowchart TB
  subgraph data_preparation ["I. Data Preparation"]
    A1[/Preparation of demographic variables/]
    A2[/Acquisition of election results/]
  end
  subgraph PCA ["II. Principal Component Analysis"]
    direction LR
    B1[PCA contribution analysis and reduction from 77 to 39 variables]
    B2[Performing PCA on 39 variables]
  end  
    B3[\Set of 12 components\]
    B4[\Table of factor loadings\]
    B5[\Maps of principal component values\]
    B6[\Impact of variables on principal components\]
    A1 --> PCA
    B1 --> B2 
    PCA --> B3
    PCA --> B4
    PCA --> B5
    PCA --> B6
  subgraph Modeling ["III. Machine Learning"]
    direction LR
    C0[Data merging and preparation]
    C1[Model training]
    C2[Model evaluation]
    C3[Selection of the best model]
  end
    A2 --merge dataset with 12 components--> Modeling
    B3 --> Modeling
    C0 --> C1 --> C2 --> C3
    Modeling --> SHAP
  subgraph SHAP ["IV. Model Decomposition and Interpretation"]
    D1[Calculation of SHAP values]
  end
    D2[\Variable importance plots\]
    D3[\Dependence plots\]
    D4[\Maps of component impacts\]
    SHAP --> D2
    SHAP --> D3
    SHAP --> D4
    
    %%classDef dane fill:#e6f0ff,stroke:#2a3f5f,stroke-width:1px;
    %%classDef dane_element fill:#9ab4db,stroke:#2a3f5f,stroke-width:1px;
    
    classDef dane fill:#999999,stroke:#000000,stroke-width:0.8px;
    classDef dane_element fill:white;
    class Modeling dane;
    class C0,C1,C2,C3 dane_element;

Machine learning

The goal was not to predict, but examining how 12 principal components (predictors) influenced election outcomes, measured as the vote difference between Democratic (<0) and Republican (>0) candidates.

Machine learning



The XGBoost model is highly complex and requires interpretable machine learning to understand its internal structure.

Model Decomposition and Interpretation

flowchart TB
  subgraph data_preparation ["I. Data Preparation"]
    A1[/Preparation of demographic variables/]
    A2[/Acquisition of election results/]
  end
  subgraph PCA ["II. Principal Component Analysis"]
    direction LR
    B1[PCA contribution analysis and reduction from 77 to 39 variables]
    B2[Performing PCA on 39 variables]
  end  
    B3[\Set of 12 components\]
    B4[\Table of factor loadings\]
    B5[\Maps of principal component values\]
    B6[\Impact of variables on principal components\]
    A1 --> PCA
    B1 --> B2 
    PCA --> B3
    PCA --> B4
    PCA --> B5
    PCA --> B6
  subgraph Modeling ["III. Machine Learning"]
    direction LR
    C0[Data merging and preparation]
    C1[Model training]
    C2[Model evaluation]
    C3[Selection of the best model]
  end
    A2 --merge dataset with 12 components--> Modeling
    B3 --> Modeling
    C0 --> C1 --> C2 --> C3
    Modeling --> SHAP
  subgraph SHAP ["IV. Model Decomposition and Interpretation"]
    D1[Calculation of SHAP values]
  end
    D2[\Variable importance plots\]
    D3[\Dependence plots\]
    D4[\Maps of component impacts\]
    SHAP --> D2
    SHAP --> D3
    SHAP --> D4
    
    %%classDef dane fill:#e6f0ff,stroke:#2a3f5f,stroke-width:1px;
    %%classDef dane_element fill:#9ab4db,stroke:#2a3f5f,stroke-width:1px;
    
    classDef dane fill:#999999,stroke:#000000,stroke-width:0.8px;
    classDef dane_element fill:white;
    class SHAP dane;
    class C0,C1,C2,C3 dane_element;

Model Decomposition and Interpretation



Model Decomposition and Interpretation

The computation of SHAP values made it possible to assess the importance of the principal components, highlighting their contribution to model results in each election year.

Model Decomposition and Interpretation

Model Decomposition and Interpretation

Model Decomposition and Interpretation

Model Decomposition and Interpretation

Model Decomposition and Interpretation

Using SHAP, dependence plots were generated to show how principal component values (x-axis) relate to the model outcome (y-axis), with each point representing a county.


The higher the SHAP value above zero indicate stronger predicted support for Republicans, while lower values below zero indicate support for Democrats.


These results can be directly linked with impact maps, which enable a spatial interpretation of how individual principal components influence the predicted support for candidates.

Model Decomposition and Interpretation

Model Decomposition and Interpretation

Model Decomposition and Interpretation

Model Decomposition and Interpretation

Model Decomposition and Interpretation

Model Decomposition and Interpretation

Component 1Poorer, less-educated, peripheral counties vs Urban, educated, wealthier counties

Component 12Affluent, married, predominantly white counties vs Younger, poorer, less-educated, predominantly Black counties

Summary

  • Use of IML revealed links between socio-demographic factors and electoral outcomes


  • Results enabled a detailed description of spatial socio-demographic profiles of voters.


  • Among the 12 components, 5 of them (1, 2, 3, 5, and 12) had the strongest impact on voting 2008–2024


  • Since 2016 Component 1 gained importance, while others weakened


  • After 2012 Component 12 steadily lost influence

Authors: Adrian Nowacki, Anna Dmowska, Jarosław Jasiewicz

Presentation: https://adrian-nowacki.github.io/ectqg2025

Software: R packages: mlr3, iml, shapviz

Next steps:
  • Detailed interpretation of the identified socio-demographic profiles.
  • Analysis and interpretation of clustering results for profiles with similar voting patterns.